Craftcans.com - cleaning

Craftcans.com provides a database of 2692 crafted canned beers. The data on beers includes the following variables:

Name
Style
Size
Alcohol by volume (ABV)
IBU’s
Brewer name
Brewer location

However, some of the variables include both number and text values (e.g. Size), while others include missing values (e.g. IBUs). In order to make the dataset ready for analysis, one needs to clean it first. We will do that using pandas and regular expressions.



In [2]:

    
import pandas, re



In [3]:

    
data = pandas.read_excel("craftcans.xlsx")



In [4]:

    
data.head()









    Out[4]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
    
  
  
    
      0
      ENTRY
      BEER
      BREWERY
      LOCATION
      STYLE
      SIZE
      ABV
      IBUs
    
    
      1
      2692.
      Get Together
      NorthGate Brewing
      Minneapolis, MN
      American IPA
      16 oz.
      4.5%
      50
    
    
      2
      2691.
      Maggie's Leap
      NorthGate Brewing
      Minneapolis, MN
      Milk / Sweet Stout
      16 oz.
      4.9%
      26
    
    
      3
      2690.
      Wall's End
      NorthGate Brewing
      Minneapolis, MN
      English Brown Ale
      16 oz.
      4.8%
      19
    
    
      4
      2689.
      Pumpion
      NorthGate Brewing
      Minneapolis, MN
      Pumpkin Ale
      16 oz.
      6.0%
      38

As it can be seen above, the header row is saved as a simple observation unit. Let's rename the columns with the real headers.



In [5]:

    
data.columns = data.iloc[0]



In [6]:

    
data.head()









    Out[6]:






  
    
      
      ENTRY
      BEER
      BREWERY
      LOCATION
      STYLE
      SIZE
      ABV
      IBUs
    
  
  
    
      0
      ENTRY
      BEER
      BREWERY
      LOCATION
      STYLE
      SIZE
      ABV
      IBUs
    
    
      1
      2692.
      Get Together
      NorthGate Brewing
      Minneapolis, MN
      American IPA
      16 oz.
      4.5%
      50
    
    
      2
      2691.
      Maggie's Leap
      NorthGate Brewing
      Minneapolis, MN
      Milk / Sweet Stout
      16 oz.
      4.9%
      26
    
    
      3
      2690.
      Wall's End
      NorthGate Brewing
      Minneapolis, MN
      English Brown Ale
      16 oz.
      4.8%
      19
    
    
      4
      2689.
      Pumpion
      NorthGate Brewing
      Minneapolis, MN
      Pumpkin Ale
      16 oz.
      6.0%
      38

Fine, but the very first row still remains. We have to drop it, and for that we will use the drop() function from the pandas library, which takes 2 arguments: the dropable row/column name and the axis (0 for rows and 1 for columns).



In [7]:

    
data = data.drop(0,axis=0)



In [8]:

    
data.head()









    Out[8]:






  
    
      
      ENTRY
      BEER
      BREWERY
      LOCATION
      STYLE
      SIZE
      ABV
      IBUs
    
  
  
    
      1
      2692.
      Get Together
      NorthGate Brewing
      Minneapolis, MN
      American IPA
      16 oz.
      4.5%
      50
    
    
      2
      2691.
      Maggie's Leap
      NorthGate Brewing
      Minneapolis, MN
      Milk / Sweet Stout
      16 oz.
      4.9%
      26
    
    
      3
      2690.
      Wall's End
      NorthGate Brewing
      Minneapolis, MN
      English Brown Ale
      16 oz.
      4.8%
      19
    
    
      4
      2689.
      Pumpion
      NorthGate Brewing
      Minneapolis, MN
      Pumpkin Ale
      16 oz.
      6.0%
      38
    
    
      5
      2688.
      Stronghold
      NorthGate Brewing
      Minneapolis, MN
      American Porter
      16 oz.
      6.0%
      25

Let's do the same for row names. Rows are called indecies in Pandas. Thus, let's take the values from the "ENTRY" column and use them to rename rows. Then, of course, we shoudl drop the additional column too.



In [9]:

    
data.index = data["ENTRY"]



In [10]:

    
data.head()









    Out[10]:






  
    
      
      ENTRY
      BEER
      BREWERY
      LOCATION
      STYLE
      SIZE
      ABV
      IBUs
    
    
      ENTRY
      
      
      
      
      
      
      
      
    
  
  
    
      2692.
      2692.
      Get Together
      NorthGate Brewing
      Minneapolis, MN
      American IPA
      16 oz.
      4.5%
      50
    
    
      2691.
      2691.
      Maggie's Leap
      NorthGate Brewing
      Minneapolis, MN
      Milk / Sweet Stout
      16 oz.
      4.9%
      26
    
    
      2690.
      2690.
      Wall's End
      NorthGate Brewing
      Minneapolis, MN
      English Brown Ale
      16 oz.
      4.8%
      19
    
    
      2689.
      2689.
      Pumpion
      NorthGate Brewing
      Minneapolis, MN
      Pumpkin Ale
      16 oz.
      6.0%
      38
    
    
      2688.
      2688.
      Stronghold
      NorthGate Brewing
      Minneapolis, MN
      American Porter
      16 oz.
      6.0%
      25



In [11]:

    
data = data.drop("ENTRY",axis=1)



In [12]:

    
data.head()









    Out[12]:






  
    
      
      BEER
      BREWERY
      LOCATION
      STYLE
      SIZE
      ABV
      IBUs
    
    
      ENTRY
      
      
      
      
      
      
      
    
  
  
    
      2692.
      Get Together
      NorthGate Brewing
      Minneapolis, MN
      American IPA
      16 oz.
      4.5%
      50
    
    
      2691.
      Maggie's Leap
      NorthGate Brewing
      Minneapolis, MN
      Milk / Sweet Stout
      16 oz.
      4.9%
      26
    
    
      2690.
      Wall's End
      NorthGate Brewing
      Minneapolis, MN
      English Brown Ale
      16 oz.
      4.8%
      19
    
    
      2689.
      Pumpion
      NorthGate Brewing
      Minneapolis, MN
      Pumpkin Ale
      16 oz.
      6.0%
      38
    
    
      2688.
      Stronghold
      NorthGate Brewing
      Minneapolis, MN
      American Porter
      16 oz.
      6.0%
      25

Nice, now let's clean some variables. Let's start from the SIZE. It includes information on size which is presented in oz or ounces or differently. We need to have numbers only. Let's first see what are the available options. FOr that purpose, we can convert that columns to a list and then use the set() function to get the unique values from the list.



In [13]:

    
data_list = data["SIZE"].tolist()
unique_values = set(data_list)
print(unique_values)









    



set([u'19.2 oz.', u'12 OZ.', u'12 & 16 oz.', u'19.2', u'12 oz', u'24 oz.', u'12 oz.', u'32 oz.', u'24 oz. "Silo Can"', u'16 oz', u'12 oz. Slimline', u'16.9 oz.', u'16 oz. Alumi-Tek\xae', u'12 ounce', u'16 oz.', u'8.4 oz.'])

Excellent. This means we can write a regular expression that will find all the digits (including those that have a dot inside) and subsitute whatever comes afterwards with an empty string.



In [14]:

    
for i in range(0,len(data['SIZE'])):
    data['SIZE'][i] = re.sub('(^.*\d)(\s*.*$)',r'\1',data['SIZE'][i])



In [15]:

    
data.head()









    Out[15]:






  
    
      
      BEER
      BREWERY
      LOCATION
      STYLE
      SIZE
      ABV
      IBUs
    
    
      ENTRY
      
      
      
      
      
      
      
    
  
  
    
      2692.
      Get Together
      NorthGate Brewing
      Minneapolis, MN
      American IPA
      16
      4.5%
      50
    
    
      2691.
      Maggie's Leap
      NorthGate Brewing
      Minneapolis, MN
      Milk / Sweet Stout
      16
      4.9%
      26
    
    
      2690.
      Wall's End
      NorthGate Brewing
      Minneapolis, MN
      English Brown Ale
      16
      4.8%
      19
    
    
      2689.
      Pumpion
      NorthGate Brewing
      Minneapolis, MN
      Pumpkin Ale
      16
      6.0%
      38
    
    
      2688.
      Stronghold
      NorthGate Brewing
      Minneapolis, MN
      American Porter
      16
      6.0%
      25

Done! Let's now go for the ABV variable. It is given in %-s, so we can keep the number only, and divide it by 100 to get the float value. But there may be some wrongly inputed values in the columns also. So let's divide only those that have correct values and assign a "missing value" value to others.



In [17]:

    
# for all values in that columns
for i in range(0,len(data['ABV'])):
    # if match exists, which means it is a correct value
    if re.match('(^.*\d)(%)',data['ABV'][i]) is not None:
        # substitute the % sign with nothing, convert result to float and divide by 100
        data['ABV'][i] = float(re.sub('(^.*\d)(%)',r'\1',data['ABV'][i]))/100
    else: # which is when the value is incorrect
        # give it the value of "nan" which stands for missing values
        data['ABV'][i] = float("nan")



In [18]:

    
data['ABV'].head(100)









    Out[18]:





ENTRY
2692.    0.045
2691.    0.049
2690.    0.048
2689.     0.06
2688.     0.06
2687.    0.056
2686.     0.08
2685.    0.125
2684.    0.077
2683.    0.042
2682.     0.05
2681.    0.066
2680.     0.04
2679.    0.055
2678.    0.076
2677.    0.051
2676.    0.065
2675.     0.06
2674.     0.05
2673.    0.052
2672.    0.048
2671.    0.072
2670.    0.067
2669.    0.049
2668.     0.08
2667.    0.075
2666.     0.06
2664.     0.08
2663.    0.063
2662.    0.058
         ...  
2619.    0.058
2618.     0.07
2617.    0.055
2616.    0.055
2615.    0.054
2614.    0.053
2613.    0.055
2612.     0.05
2611.     0.06
2610.    0.064
2609.    0.058
2608.    0.065
2607.    0.068
2606.    0.078
2605.    0.052
2604.    0.049
2603.    0.065
2602.     0.05
2601.     0.08
2600.    0.085
2599.    0.069
2598.    0.098
2597.     0.06
2596.    0.062
2595.      NaN
2594.    0.052
2593.     0.07
2592.    0.059
2591.    0.045
2590.    0.055
Name: ABV, dtype: object

Great! Let's now get some info on our dataframe



In [19]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 2410 entries, 2692. to 1.
Data columns (total 7 columns):
BEER        2410 non-null object
BREWERY     2410 non-null object
LOCATION    2410 non-null object
STYLE       2405 non-null object
SIZE        2410 non-null object
ABV         2348 non-null object
IBUs        1405 non-null object
dtypes: object(7)
memory usage: 150.6+ KB

As you can see the ABV guy is left with only 2348 values out of 2410, as we assigned "nan" to incorrect values. Let's impute those missing values. As it is a variable with integer values,we can impute with mean using the fillna() function from pandas.'



In [20]:

    
data['ABV'] = data['ABV'].fillna(data['ABV'].mean())



In [21]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 2410 entries, 2692. to 1.
Data columns (total 7 columns):
BEER        2410 non-null object
BREWERY     2410 non-null object
LOCATION    2410 non-null object
STYLE       2405 non-null object
SIZE        2410 non-null object
ABV         2410 non-null float64
IBUs        1405 non-null object
dtypes: float64(1), object(6)
memory usage: 150.6+ KB

Done! But there is another variable with missing values: IBUs. Let's make an imputation for that one also, but this time instead of mean let's use the backward/forward filling method.



In [22]:

    
data['IBUs'] = data['IBUs'].fillna(method = "bfill")



In [23]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
Index: 2410 entries, 2692. to 1.
Data columns (total 7 columns):
BEER        2410 non-null object
BREWERY     2410 non-null object
LOCATION    2410 non-null object
STYLE       2405 non-null object
SIZE        2410 non-null object
ABV         2410 non-null float64
IBUs        2410 non-null object
dtypes: float64(1), object(6)
memory usage: 150.6+ KB

	0	1	2	3	4	5	6	7
0	ENTRY	BEER	BREWERY	LOCATION	STYLE	SIZE	ABV	IBUs
1	2692.	Get Together	NorthGate Brewing	Minneapolis, MN	American IPA	16 oz.	4.5%	50
2	2691.	Maggie's Leap	NorthGate Brewing	Minneapolis, MN	Milk / Sweet Stout	16 oz.	4.9%	26
3	2690.	Wall's End	NorthGate Brewing	Minneapolis, MN	English Brown Ale	16 oz.	4.8%	19
4	2689.	Pumpion	NorthGate Brewing	Minneapolis, MN	Pumpkin Ale	16 oz.	6.0%	38